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Abstract 

The use of mutual information as a similarity measure in agglom- 
erative hierarchical clustering (AHC) raises an important issue: some 
correction needs to be applied for the dimensionality of variables. In 
this work, we formulate the decision of merging dependent multivariate 
normal variables in an AHC procedure as a Bayesian model compar¬ 
ison. We found that the Bayesian formulation naturally shrinks the 
empirical covariance matrix towards a matrix set a priori (e.g., the 
identity), provides an automated stopping rule, and corrects for di¬ 
mensionality using a term that scales up the measure as a function of 
the dimensionality of the variables. Also, the resulting log Bayes factor 
is asymptotically proportional to the plug-in estimate of mutual infor¬ 
mation, with an additive correction for dimensionality in agreement 
with the Bayesian information criterion. We investigated the behavior 
of these Bayesian alternatives (in exact and asymptotic forms) to mu¬ 
tual information on simulated and real data. An encouraging result was 
first derived on simulations: the hierarchical clustering based on the log 
Bayes factor outperformed off-the-shelf clustering techniques as well as 
raw and normalized mutual information in terms of classification ac¬ 
curacy. On a toy example, we found that the Bayesian approaches 
led to results that were similar to those of mutual information cluster¬ 
ing techniques, with the advantage of an automated thresholding. On 
real functional magnetic resonance imaging (fMRI) datasets measur¬ 
ing brain activity, it identified clusters consistent with the established 
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outcome of standard procedures. On this application, normalized mu¬ 
tual information had a highly atypical behavior, in the sense that it 
systematically favored very large clusters. These initial experiments 
suggest that the proposed Bayesian alternatives to mutual information 
are a useful new tool for hierarchical clustering. 

Keywords: agglomerative hierarchical clustering; Bayesian model 

comparison; BIG; mutual information; multivariate normal distribu¬ 
tions; normalized mutual information. 


1 Introduction 


Cluster analysis aims at uncovering natural groups of objects in a multivari¬ 


ate dataset [see Jain (2010) for a review]. In the vast variety of methods 


used in cluster analysis, an agglomerative hierarchical clustering (AHC) is 
a generic procedure that sequentially merges pairs of clusters that are most 
similar according to an arbitrary function called similarity measure, thereby 


generating a nested set of partitions, also called hierarchy (Duda et al. 


2000). The choice of the similarity measure indirectly defines the shape of 


the clusters, and thus plays a critical role in the clustering process. While 
this choice is guided by the features of the problem at hand, it is also of¬ 
ten restricted to a limited number of commonly used measures, such as the 


Euclidean distance or Pearson correlation coefficient (D’haeseleer, 2005). In 


the present work, we focus on the clustering of random variables based on 
their mutual information, which has recently gained in popularity in clus¬ 


ter analysis, notably in the field of genomics (Butte and Kohane, 2000 


Benjaminsson et ah, 2010 Kolchinsky et ah, 2014). Mutual information is a 


Zhou et al. 

2004; 

Dawy et al., 2006; Priness et al. 

2007 

) and in functional 

magnetic resonance imaging (fMRI) data analysis 

(Stausberg et al. 

2009 


general measure of statistical dependency derived from information theory 


(Shannon 1948 Kullback, 1968; Cover and Thomas, 1991). A key feature 


of mutual information is its ability to capture nonlinear interactions for any 
type of random variables (Steuer et al., 2002); also of interest, it indifferently 


applies to univariate or multivariate variables and can thus be applied to 
clusters of arbitrary size. Yet, mutual information is an extensive measure 
that increases with variable dimensionality. In addition, I, the finite-sample 
estimator of mutual information, suffers from a dimensionality-dependent 
bias (see Appendix 0. Several authors have proposed to correct mutual 
information for dimensionality by using a “normalized” version of mutual 
information (Li et al^ [2001 Kraskov et al. 2005; Kraskov and Grassberger 


2009). In the clustering literature, normalized mutual information is rou¬ 


tinely used. However, the impact of such correction procedure has not been 
extensively evaluated so far. 


In the present paper, we consider Bayesian model-based clustering (Scott 
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as an alternative to mutual information for the hierarchical clustering of 
dependent multivariate normal variables. Specifically, we derive a similarity 
measure by comparing two models: Aii where Xi and Xj are indepen¬ 
dent (i.e., the covariance between any element of Xi and any element of 
Xj is equal to zero), against Md where the covariance between Xi and 
Xj can be set to any admissible value. The proposed similarity measure is 


then the log Bayes factor in favor of Md against Xtj (Kass and Raftery 


1995).With appropriate priors on the model parameters, we show that the 
similarity measure s(Xi, Xj) between Xi and Xj can be expressed in closed 
form. As will be shown below, the Bayesian formulation naturally (1) al¬ 
lows for clustering even when the sample covariance matrix is ill-defined; 
(2) provides for an automated stopping rule when the clustering reached 
has s{Xi,Xj) < 0 for any pair of remaining clusters; (3) corrects for di¬ 
mensionality using a term that scales up the measure as a function of the 
dimensionality of the variables; and (4) provides for a local and global mea¬ 
sure of similarity, in that it can be used to decide which pair of variables to 
cluster at each step (local level) as well as to compare different levels of the 
resnlting hierarchy (global level). Asymptotically (i.e., when the number of 
samples —)• oo), the similarity measure is a linear fnnction of mutual in¬ 
formation, with a penalization factor that is in agreement with the Bayesian 
information criterion (BIG) (Schwarz, 1978). In this sense, the present pa¬ 
per makes an explicit connection between Bayesian model comparison for 
the clustering of dependent random variables and mutnal information. The 
code corresponding to the Bayesian approach is freely available onlin^ 

We evaluated an AHC procedure based on this approach with synthetic 
datasets. The experiment aimed to evaluate how it behaved under both its 
exact and asymptotic forms compared to other approaches, inclnding raw 
and normalized mutual information. We finally tested the new measures on 
two real datasets: a toy dataset and functional magnetic resonance imaging 
(fMRI) data. 


2 Analysis 

In the following, we develop a Bayesian solution to the problem of cluster¬ 
ing detailed above. We first introduce the model together with the Bayesian 
framework and a general expression for the similarity measure. In sub¬ 
sequent subsections, we derive a closed-form expression for the marginal 
model likelihoods under both assumptions of dependence and independence 
as well as exact and asymptotic expressions for the similarity measure. We 
then provide a description of the hierarchical agglomerative clnstering algo¬ 
rithm resulting from the present development. We examine how the same 

^https://github.com/SIMEXP/arXiv-1501.05194/releases/tag/l.0 
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framework can be conveniently used to compare nested partitions, that is, 
different levels of a hierarchy. We also deal with the issue of setting the 
hyperparameters. Finally, we show how the Bayesian solution can naturally 
provide for an automatic stopping rule. 

2.1 Bayesian model comparison 

Let W be a D-dimensional multivariate normal variable with known mear[3 
fj. and unknown covariance matrix S. Define Xi and Xj as two disjoint 
subvectors of X (of size Di and Dj, respectively), and X^jj as their union 
(of size Diuj = Di + Dj). Assume that we have to decide whether we 
should cluster Xi and Xj based on their level of dependence. To this 
end, consider two competing models Aij and Aio- In -M-i, Xi and Xj are 
independent and the distribution of Xnjj can be decomposed as the product 
of the marginal distributions of Xi and Xj. Under such condition, Hiujj 
the restriction of S to Xnjj, is block diagonal with blocks Sj and Sj, the 
restrictions of 51 to Xi and Xj, respectively. In by contrast, Xi and 
Xj are dependent. Given a dataset {xi,..., X]\f} of N independent and 
identically distributed (i.i.d.) realizations of X and S the corresponding 
sample sum-of-square matrix 


N 

S = ^(Sn - fji){Xn - /l)^ 

n=l 


we propose to quantify the similarity between Xi and Xj as the log Bayes 
factor, that is, the log ratio of the marginal model likelihoods of Ain versus 
Ml 


s(Xi,Xj) = In 


piSiujlMn) 

p(SiujlMi) ■ 


( 1 ) 


Each marginal model likelihood can be expressed as an integral over the 
model parameters as described below. 


^For the sake of simplicity, we assumed a known mean in the following theoretical 
development. If the mean is unknown (as will be the case in the simulation and real data 
sections), this development is still valid, with N replaced by — 1 and S by 


N 

^(a:„ - m){xn - mf, 

n = l 


where m is the sample mean 



n=l 
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2.2 Marginal model likelihood under the hypothesis of de¬ 
pendence 

Let us first calculate P{Si\jj\M.£))^ the marginal model likelihood under the 
hypothesis of dependence. Expressing this quantity as a function of the 
model parameters yields 


P(‘S'iUj I_D) — / p('S'iUj ) P(^iUj d5]juj ■ 


( 2 ) 


Calculation of the integral requires to know the likelihood p{Si\jj\M.D, ^iuj) 
and the prior distribution p(I]juj|M_D) of the covariance matrix under Ain- 
With multivariate normal data, Snjj given 


with N degrees of freedom and scale matrix (Anderson 
lary 7.2.2), leading to the following likelihood 


is Wishart distributed 
Corol- 


2003 


p(S'iui|M D ) '^iUj ) 




Z{Di\jj, N) 


-‘iUj \ 


exp 


-1 




jSiUj) 


(3) 


where Z{d,n) is the inverse of the normalization constant 


Z{d, 


n 


nd d{d—l) 

= 2 2 TT 4 


n> 

d'=i 


n + 1 — d' 


As to the prior distribution, this quantity is here set as a conjugate prior, 
namely an inverse-Wishart distribution with degrees of freedom and 
inverse scale matrix Ajuj 


(Gelman et ah, 2004, §3.6) 




HUj I 


^iUj 

2 


Z{Diijj ) ^iuj ) 


■‘iUj \ 


2 exp 




(4) 


Bringing Equations and @ together into Equation (§ yields for the 
marginal model likelihood 


p(S'iujlAfD) 


I . I I ^ I ^ ^iUj ^ 

I ^iUj I ^ I iUj I ^ 

Z{Diuj ^ N) Z(^Diuj , Uiuj ) 

X / |Siuj| 2 exp 



(Ajuj + Siijj)'S 


-1 

iUj 


dS 


iUj- 


The integrand is proportional to an inverse-Wishart distribution with N + 
Viuj degrees of freedom and scale matrix A^uj + Snjj, leading to 


p(5iuj|Afn) = 


I'S^iujl 2 Z{Djuj , N + ViDj ) 
Z{Diijj , N) Z{Diijj , I'iijj) 


I * I 

I -^iUj I ^ 


I ^iUj “i” -^iUj I ^ 


(5) 
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2.3 Marginal model likelihood under the hypothesis of inde¬ 
pendence 

We can now calculate P{Si\jj\Aii), the marginal model likelihood under the 
hypothesis of independence. If At/ holds, then Hjuj is block-diagonal with 
two blocks Sj and Sj the submatrix restrictions of Hjuj to Xi and Xj, 
respectively. Introduction of the model parameters therefore yields for the 
marginal likelihood 

p{Siuj\Mi) = j p{Siuj\Mi,'Si,'Sj)p{'Si,'Sj\Mi)d'Sid'Sj. ( 6 ) 


To calculate this integral, we again need to know the likelihood p{Si\jj\Xii, 5]j, Sj) 
and the prior distribution p(5]j, of the two blocks of the covariance 

matrix under AI/. The likelihood is the same as for Mb and has the form 
of Equation Q. Furthermore, since Xljuj is here block diagonal, we have 
|Sjuj| = |Si| |Sj| and tr(Xlj^^^-5iui) = tr(E~^Si) + tr(i:~^Sj), where Si and 
Sj are the restrictions of S to Xi and Xj, respectively. Consequently, the 
likelihood can be further expanded as 


Xlj, Uj) 


X) 


n 1^-= 

ke{i,j} 


N 

2 exp 




(7) 

As to the prior distribution, assuming no prior dependence between Xlj and 
Sj yields 

p(S„Sj|Al/) = p(Si|Al/)p(Sj|Al/). (8) 


For the sake of consistency, p(Sj|AI/) and p(Sj|AI/) are set equal to 
p(Si|AlD) and p(Sj|Al£)), respectively, which are in turn obtained by 
marginalization of p(Sjuj|AlD) as given by Equation Q. For k G {i,j}, 
p(Sfc|Al 7 ) can be found to have an inverse-Wishart distribution with = 
k'iuj — Diuj + Dk degrees of freedom and inverse scale matrix A*, the restric¬ 
tion of Ai\jj to Xk (Press, 2005, §5.2) 


p(Sfc|AI/) = 


(Press 

2005 

|A 

k\ 2 


^fc+-PA:+^ 

2 exp 


--tr(AfcS;, ) 


(9) 


Incorporating Equations ([^, ([^, and (|^ into Equation ([^ yields 




Siuj\' ^ 

Z{Diljj, N) 


n 


-^k 


2 


Z{Dk,Uk) 
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Each integrand is proportional to an inverse-Wishart distribution with N + 
Uk degrees of freedom and scale matrix Sk + leading to 


p(S,ui|Af/) = 


I ^iUj I ^ 

Z{Di\jj, N) 


n 

ke{i,j} 


Z{Dk, N + Vk) 


Yk 


|Ai,| 2 


Z{Dk,Vk) |5fc + Afc|^' 


2.4 Log Bayes factor of dependence versus independence 

Let us now express the Bayesian similarity measure by incorporating Equa¬ 
tions ([^ and ( fTo] ) into Equation Q, yielding 


s{Xi,Xj) = 




X -\- Vi\jj 


with 


cst = ^ln|Aiuj| + ^ 


D, 


Uj r 


d=l 


InE 


X i^iuj -|- 1 — d 


la IA juj -|- Si\jj I -|- cst 

( 11 ) 

k'iijj + 1 — d 


-InE 


( 12 ) 


Dfr 


E t>“I^‘I + E 


fcefij} 


d=i 


InE 


X I'k ^ ~ d 


+ lnT 


Vk + l- d 


Another form for s{Xi,Xj) is 

s(Xi^ XXcpi 

with 

A4>k = (j){X + idk,Ak + Sk) - 0(z2fc, Afc), k £ {i,j,iUj}, 

and 


dim(A) 


(j){n, A) = -ln|A|-|- InE 


d=i 


n + 1 — d 


(13) 


(14) 


(15) 


A(pk quantifies, up to a constant that cancels out in s{Xi, Xj), the amount 
by which the data support a model of multivariate normal distributions with 
unrestricted covariance matrix for Xk- 


2.5 Asymptotic form of the log Bayes factor 

We can now provide an asymptotic expression for s{Xi,Xj). Define Sk as 
the standard sample covariance matrix, i.e., Sk = XSk- When N —)■ oo, we 
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can use Stirling approximation ( Abramowitz and Stegun| |1972[ p. 257) to 
expand the expression (j) of Equation (15), leading to (see Appendix [B|) 


s{Xi,Xj) = 


2 \Siuj\ 


DiUjjDiUj + 1) 


= Ni{Xi,Xj)-^^ 


lnA^ + 0(l), 


DkjDk + 1 ) 
^ " 2 


lnA^ + 0(l) 
(16) 


where ^ ^ 

i(Xi,X,) = |lnS^ 

^ PiUj'l 

is the plug-in estimator of mutual information for a multivariate normal 
distribution. Alternatively, NI{Xi,Xj) can be seen as the minimum dis¬ 
crimination information for the independence of Xi and Xj (Kullback, 1968 
Chap. 12, §3.6). 


2.6 Hierarchical agglomerative clustering 


A hierarchy on a set of D variables is a nested set of partitions (C/)^q, 


where C/ is a partition of {1,..., D} into D — I clusters (Duda et al., 2000). 


A hierarchical agglomerative clustering (AHC) is a generic procedure to 
generate such a hierarchy, outlined in pseudo-code in Algorithmic The main 
steps of the algorithm are: (1) Initialize the partition with singletons (line 2); 
(2) derive a matrix s; where each element represents the similarity between 
two clusters Xi and Xj of C;, based on an arbitrary function s{Xi,Xj) 
(line 4); (3) identify the two clusters that are most similaijC (line 5); (4) 
form a new partition identical to the previous one, except that the two most 
similar clusters are now merged (lines 6-7); (5) iterate Steps 2-4 until the 
partition has only one single element which covers the whole set of variables 
(line 3). In the case of the methods proposed here, the similarity measure 
is given by either Equation 0 for the exact formulation or Equation ( |29[ ) 
for the asymptotic BIC approximation. 


2.7 Comparing distinct levels of the hierarchy 

The last development aims at providing a way to compare nested partitions, 
i.e., different levels of the hierarchy. Once the hierarchical clustering has 
been performed, each step is associated with a partition of X. Assume 

^There may be more than one pair of clusters which maximize the similarity function. 
Most implementations of AHC proceed by selecting arbitrarily one such pair (e.g., the 
first one to be detected). In the in-house implementation we used, the pair was selected 
randomly amongst all these pairs. This was done to properly capture the instability of 
the algorithm. In such a form, AHC may not be deterministic anymore, in that two runs 
of the same algorithm on the same dataset may result in different hierarchies. 
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1: return Hierarchy (Ci)Pn 

2: Co^{{X,},...,{Xd}} 

3: for I = 0,..., D — 2 do 
4: Si ^ 

5: ^ argmaxjj Si{iJ) 

6: Ci+iCi\{i* ,j*} 

7: Ci+i^Ci\j{i*\Jj*} 

8 ; end for 

Algorithm 1; General description of the hierarchial agglomerative clus¬ 
tering algorithm. 


that, at level I, the partition reads {Xi,..., Xk} and that, at step I + I, 

Xi and Xj are clustered. Denote by Ci the assumption that the partition 
at level I is the correct partition of X. The global improvement brought by 
the clustering of Xi and Xj between steps I and I -|- 1 can be quantified by 
the log ratio between the marginal model likelihoods 

p(^^ 

p{S\Ci) ’ 

where both quantities can be computed in a manner similar to what was done 
for the similarity measure. For instance, if Ci is true, then S is block-diagonal 
with K blocks Hfc’s, the submatrix restrictions of 51 to Xk- Introducing the 
model parameters then yields 

K 

p(S|C0 = / i>{S\Ci, Si,..., p(Si,..., Si^ICz) n dSfc. (17) 

k=i 

In a way similar to what was done previously, the likelihood p(S^|C/, Si,..., S/^) 
can be expanded as 

N-D-l K 

_ _ ^ 2 •%—p iV 

p{S\Ci, Si,..., Si^) = exp 

Turning to the prior distribution p(Si,... ,Si^|C;) and assuming no prior 
dependence between the S^’s, we can set 

K 

p(Si,..., Sii-|Ci) = ]^ p(Sfc|Ci). (19) 

fc=i 

Each p(Sfc|C;) can be obtained by the marginalization of p(S|C/), which 
is here taken as an inverse-Wishart distribution with v degrees of freedom 
and inverse scale matrix the D-hy-D diagonal matrix A. Note that such 
a prior on S is compatible with the prior used earlier for Sjuj if one sets 


1—11 


2tr(S^ Ofcj 


( 18 ) 
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t'iuj = ly — D + Diuj and if Ajyj is the restriction of A to X^jj (Press 
§5.2). We then have 


2005 


p(Sfc|Cz) = 


I A 1-^ 

|Afc| 2 , 
Z{Dk,i'k) 


^fc+-Pfc + ^ 

2 exp 


--tr(AfcS^^) 


( 20 ) 


Incorporating Equations (18), (19), and (20) into Equation ( [ItI ) and inte¬ 
grating leads to 




I A 


Z{D,N) 


k=l 


^{^ki ^k) 


N+uu 

H + Sk\-^ 


( 21 ) 


The same calculation can be performed for p(S'|Cz_|_i). The result is the same 


as in Equation (21), except that the product is composed of A — 1 terms. 


Of these terms, K — 2 correspond to clusters that are unchanged from Ci 


to Cz+i and, as a consequence, are identical to those of Equation (21). The 


[K — l)th term corresponds to the cluster obtained by the merging of Xi 
and Xj. As a consequence, the log Bayes factor reads 


In 


P(g|C^+i) 

p(-^ICz) 


= In 


Z{Diyjj,N -|- t'iuj) 


HUj 


2 


, l^iuj ) I ^ 


iUj “1“ Si\jj I 


iV+i/, 


'iuj 


Z] I'' 

k£{i,j} 


I A 1^ 


Z(^Dki k'k') 


I . Cl I 

Afc -|- Sk\ 2 


But this quantity is nothing else than s{Xi, Xj). In other words, we proved 
that 


In^^Z^ = s{Xi,X 


3h 


( 22 ) 


v{S\Ci) 

i.e., s{Xi, Xj), the local measure of similarity between Xi and Xj, can 
be used to compute the global measure of relative probability between two 
successive levels of the hierarchy. 


2.8 Setting the hyperparameters 

Hyperparameter selection is often a thorny issue in Bayesian analysis. We 
here considered two approaches. The first approach (coined BayesCov) is 
to set the degree of freedom to the lowest value that still corresponds to 
a well-defined distribution, that is v = D, and a diagonal scale matrix 


that optimizes the marginal model likelihood of Equation (21) before any 


clustering (that is, with K = D clusters and Dk = 1 for all fc), yielding (see 
Appendix 

AcZd — Sdd, 
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where and S^d are the diagonal elements of the prior inverse scale ma¬ 
trix A and sum-of-square matrix S, respectively. An alternative approach 
(coined BayesCorr) is to work with the sample correlation matrix instead 
of the sample covariance matrix. One can then set the number of degrees 
of freedom to = D -I- 1 and the scale matrix to the identity matrix. The 
corresponding prior distribution yields uniform marginal distributions for 
the correlation coefficients (Barnard et ah, 2000). Note that clustering with 


the asymptotic form of Equation (|29|) (coined Bic) does not involve hyper¬ 
parameters; it is also insensitive to the fact that the input is the covariance 
matrix or the correlation matrix. 


2.9 Automatic stopping rule 


An advantage of the Bayesian clustering scheme proposed here and its BIC 
approximation is that they come naturally with an automatic stopping rule. 
By definition of s in Equation (Q, the fact that s(XjQ, Xj^) > 0 for the pair 
that is selected for clustering also means that the marginal model likelihood 
for Ad^) is larger than that for Aij. As such, Ajp and are more likely to 
belong to the same cluster than not and, as a consequence, it indeed makes 
sense to cluster them. By contrast, if we have s(XiQ, Xjg) < 0 for the same 
pair, the marginal model likelihood for Ad^ is smaller than that for Ad/. 
Xifj and Ajp are therefore more likely to belong to different clusters. If, 
at a given step of the clustering, the pair with highest similarity measure 
has a negative similarity measures, then all pairs do, meaning that all pairs 
of clusters tested more probably belong to different clusters. It therefore 
makes sense to stop the clustering procedure at that point. This shows that 
an automatic stopping rule can simply be implemented into the clustering 
algorithm: Stop the clustering if the pair (Xi^^jXj^) selected for clustering 


at a given step has s(Ajp, Xj^) 


< 0. Note that, according to Equation (22), 


the resulting clustering corresponds to the one in the hierarchy that has 
largest marginal likelihood. We will refer to BayesCovAuto, BayesCorrAuto 
and BicAuto when applying the clustering schemes with this automated 
stopping rule. 


3 Results 

3.1 Validation on synthetic data 

To assess the behavior of the method expounded here, we examined how 
it fared on synthetic data. We used the two variants of the Bayes fac¬ 
tor mentioned above (BayesCov and BayesCorr), Bic, as well as the same 
methods with automatic stopping rule (BayesCovAuto, BayesCorrAuto and 
BicAuto). As a mean of comparison, we also used the following methods— 
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• A random hierarchical clustering scheme, where variables were clus¬ 
tered uniformly at random at each step. This category contains only 
one algorithm: Random, which was implemented for the purpose of the 
present study. 


• Hierarchical clustering schemes with similarity measures given by ei¬ 
ther Pearson correlation or absolute Pearson correlation, and a merg¬ 
ing rule based on either the single, average, or complete linkage, or 
using Ward criterion. This category contains 8 algorithms: Single, 
Average, Complete, Ward, SingleAbs, AverageAbs, CompleteAbs, WardAbs. 
We used the implementations of these methods proposed in NIAK|^ 

• Hierarchical clustering with a similarity measure given by mutual in¬ 
formation, with and without normalization. This category contains 
2 algorithms: Infomut and InfomutNorm. These methods were im¬ 
plemented for the purpose of the present study. Note that neither 
algorithm can run on small samples. 


An approach where the clusters are estimated as the blocks of the 
precision (i.e., inverse covariance) matrix estimated with the graphical 
lasso—essentially a maximum-likelihood with Li-norm penalization 
(Friedman et ah, 2008). The penalization parameter A was set in [0,1] 


by step of 0.1, then to 5, 10, 20, 50, 100, 200 (Smith et al. 

2011) 

A ver 

sion that optimizes A with a BIC criterion was also used 

(Lian 

2011) 


Since the graphical lasso is not invariant by transformation of a covari¬ 
ance matrix into a correlation matrix, we used either the covariance 
matrix or the correlation matrix as input. Note that this approach 
automatically determined a number of clusters. Also, for A = 0 (un¬ 
constrained case), the algorithm cannot run on small samples. This 
category contains 34 algorithms: 16 algorithms gLassoCov-x and 16 
algorithms gLassoCorr-x, where x is the value of A, and 2 algorithms 
gLassoCovOpt and gLassoCorrOpt. For this category of algorithms, 
we used a package freely availably and already used in (Smith et al. 


2011 ). 


An approach based on the spectral clustering (von Luxburg 2006) of 
either the raw value or the absolute value of either the correlation 
or the partial correlation matrix. This approach required the num¬ 
ber of clusters as input. Since this clustering has a step of A:-means, 
which is stochastic by nature, we considered 2 variants: one with 1 
step of A:-means and the other one with 10 repetitions of /c-means 
and selection of the best clustering in terms of inertia. The similarity 
measures were defined so that the range would be the same (namely 


"^https: //github. com/SIMEXP/niak 

®http: //www. cs .ubc . ca/~schmidtm/Software/Llprecision.html 
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in [0,1]) when using the signed or the absolute value of correlation: 
0.5(1 + r) and |r|, respectively. This category contains 8 algorithms: 
2 algorithms SpecCorr-x, 2 algorithms SpecCorrAbs-x, 2 algorithms 
SpecCorrpar-x and 2 algorithms SpecCorrparAbs-x, where x is the 
number of times that fc-means is performed. The spectral clustering 
algorithms of this category were coded for the purpose of the present 
study, while we used the implementations of the fe-means algorithm 
proposed in NIAK. 

All in all, 59 variants were tested. 


Data description. In order to assess the performance of the Bayesian 
approach, we performed the following set of simulations. For each value of D 
in {6,10, 20, 40}, we considered partitions with increasing number of clusters 
C (1 < C < D). For a given value of C, we performed 500 simulations. For 
each simulation, the D variables were randomly partitioned into C clusters, 


all partitions having equal probability of occurrence (Nijenhuis and Wilf 


1978, chap. 12); (Wilf, 1999). For a given partition {Xi, 
generated data according to 


K 


fix) = fkiXk), 


,Xk} of X, we 


(23) 


k=l 


where all fkS were taken either as multivariate normal distributions (pa¬ 
rameters: mean and covariance matrix S^) or multivariate Student-t 
distributions (parameters: degres of freedom i/, location parameter and 
scale matrix S^). In both case, the p^’s were set to 0 and the Xl^’s were 
first sampled according to a Wishart distribution with Dk + 1 degrees of 
freedom and scale matrix the identity matrix and then rescaled to a corre¬ 
lation matrix. The sampling scheme on generated correlation matrices 
with uniform marginal distributions for all correlation coefficients ( ]Barnard| 
|2000). For the multivariate Student-t distributions, u was set in 


et ah 


{1,3, 5}. Equation (23) was used to generate synthetic datasets of length N 
varying from 10 to 300 by increment of 40. Each dataset was summarized 
by its sample correlation matrix and hierarchical clustering was performed 
using the methods mentioned above. All simulations were implemented us¬ 
ing the Pipeline System for Octave and Matlab, PSOMpj ( [Bellec et al. 2012) 
under Matlab 7.2 (The MathWorks, Inc.) and run on a 24-core server. 

To assess the efficiency of the various methods, we thresholded each clus¬ 
tering at the true number of clusters, except for BayesCovAuto, BayesCorrAuto, 
BicAuto and gLasso for which we used the clustering obtained by the 
method. We then quantified the quality of the resulting partition using 
the proportion of correct classifications as well as the adjusted Rand index, 


“https://github.com/SIMEXP/psom 
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which computes the fraction of variabie pairs that are correctiy ciustered 
corrected for chance (Rand, 1971} Hnbert and Arable, 1985). Resuits corre¬ 
sponding to a given dimension D and a given method were then pooled across 
nnmbers of clusters C, lengths N and distribntions (mnltivariate normal and 
Stndent). We classified the methods from best to worst based on these global 
resnlts using the following indices (in this order): median of adjusted Rand 
index, 25% percentile value of adjusted Rand index, 5% percentile value of 
adjusted Rand index, smallest value of adjusted Rand index, and proportion 
of correct classifications. Note that some algorithms (Inf omut. Inf omutNorm 
and SpecCorrpar) require the sample covariance matrix to be definite pos¬ 
itive. As a consequence, these algorithms could not run on small samples. 
We therefore restrained our evaluation to cases where all algorithms were op¬ 
erational. Finally, we performed a Bayesian ANOVA-like regression analysis 
(Gelman et ah, 2004[ §15.6), where we explained the adjusted Rand index 


of nine algorithms (BayesCov, BayesCovAuto, BayesCorr, BayesCorrAuto, 
Bic, BicAuto, Infomut, InfomutNorm, and AverageAbs) with the follow¬ 
ing effects: clustering algorithm (9 levels), number of variables D (4 levels: 
D G {6,10,20,40}), type of distribution (4 levels: multivariate Gaussian or 
multivariate Student-t with 1, 3, or 5 degrees of freedom), number of sam¬ 
ples N (8 levels: N G {10,50,90,130,170,210,250,290}), and number of 
clusters C (68 levels: from 2 to H — 1 clusters for each D). In other words, 
the model considered was of the form 


adj Rand index(algo, H, distr, N, C) = /3o+/3aigo+/3D+/3distr+/?Af+/?£>,c+e- 

(24) 

Interactions between effects, while potentially relevant, were not considered 
to keep the analysis tractable. The posterior distribution for the various 
regression parameters were estimated nsing Gibbs sampling. 


Results. The results corresponding to the adjusted Rand index and frac¬ 
tion of correct classification are summarized in Figs and for the 20 
best methods. Globally, and as expected, all methods were adversely af¬ 
fected by an increase in the number of variables. In all cases, the variants 
proposed in the present paper performed very well compared to other meth¬ 
ods. BayesCov and BayesCorr were always classified as the two best al¬ 
gorithms and Bic was never outperformed by a method already published. 
The methods with automatic thresholding of the hierarchy performed sur- 
prinsingly well, considered that they were compared against methods with 
oracle. In particular, they clearly outperformed all variants of gLasso, the 
only method that was able to automatically determine the nnmber of clns- 
ters. Of note, all variants of gLasso proved too slow to be applied to our 
simulation data for D G {20,40}. 

The results of the regression analysis are represented in Fig The 9 
algorithms selected inclnded the ones proposed in the present manuscript 
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Figure 1: Simulation study. Computational time (top), adjusted Rand 
index (middle) and proportion of correct classifications (bottom) for D = 6 
(left) and L) = 10 (right). 
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Figure 2: Simulation study. Computational time (top), adjusted Rand 
index (middle) and proportion of correct classifications (bottom) for D = 20 
(left) and D = AO (right). 
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(BayesCov, BayesCovAuto, BayesCorr, BayesCorrAuto, Bic, and BicAuto), 
the best-performing classic algorithm in the previons analysis (AverageAbs) 
as wel as the algorithms based on mutual information (Inf omut and Inf omutNorm). 
We found that increasing the dataset size (N) increased the performance of 
the algorithm, while increasing the dimensionality of the problem (D) and 
the number of clusters (C) decreased it. Note that dimension was found 
to have a negative influence on the adjusted Rand index, even though this 
index was partly proposed as a modification from the raw Rand index to 
overcome this limitation. Finally, this analysis confirmed the superior be¬ 
havior of the methods proposed here, even when the method included the 
automatic stopping rule. 


Toy example 


This data set was first used in Roverato (1999) and later re-analyzed in Mar- 


relec and Benali (2006) in the context of conditional independence graphs. 


It originates from a study investigating early diagnosis of HIV infection 
in children from HIV positive mothers. The variables are related to vari¬ 
ous measures on blood and its components: Xi and X 2 immunoglobin G 
and A, respectively; X 4 the platelet count; A 3 , A 5 lymphocyte B and T4, 
respectively; and Ag the T4/T8 lymphocyte ratio. The sample summary 
statistics are given in Table Roverato (1999) found that the correla¬ 


tions between A 4 and any other variable had strong probability around 
zero and hypothesized that the model was overparametrized. Based on 
the simulation study, we performed clustering of the data with the fol¬ 
lowing methods: BayesCov(Auto), BayesCorr (Auto), Bic(Auto), Infomut, 
InfomutNorm, SingleAbs, AverageAbs, CompleteAbs, WardAbs, SpecCorrAbs 
and SpecCorrparAbs. For spectral clustering, we used either 1 or 10 repe¬ 
titions of the /c-means step; since the results obtained for 1 step of fc-means 
were highly variable for 3, 4, and 5 clusters, we discarded these results. 


Table 1: Toy example. Summary statistics for the HIV data: sample vari¬ 
ances (main diagonal), correlations (lower triangle) and partial correlations 


(upper triangle) [from Roverato (1999)]. 


Xl 

8.8374 

0.479 

-0.043 

-0.033 

0.356 

-0.236 

X2 

0.483 

0.1919 

0.068 

-0.084 

-0.224 

- 0.110 

X3 

0.220 

0.057 

8924231.9 

0.085 

0.552 

-0.330 

X 4 

-0.040 

-0.133 

0.149 

20392.4 

0.091 

0.013 

X5 

0.253 

-0.124 

0.523 

0.179 

1952795.2 

0.384 

xq 

-0.276 

-0.314 

-0.183 

0.064 

0.213 

1.378 


Xl 

X2 

X 3 

X 4 

Xb 

Xq 


The resulting clusterings are given in Fig [^ and Table All hierar¬ 
chical clusterings started by clustering A 3 and A 5 (lymphocite). This was 
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Figure 3: Simulation study. Result of the regression analysis. Posterior 
distribution for the different regression coefficients: /3o (global effect), /^at 
( dataset size A^), /3aigo (algorithm), /3z) (number of variables D), /3distr (type 
of distribution), and /3 d, c (number of clusters) [see Equation ([24|)]. 
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confirmed by SpecCorrAbs-10 when required to provide 5 clusters. All hier¬ 
archical clustering methods then clustered Xi and X 2 (immunoglobin). This 
result was in agreement with both SpecCorrAbs-10 and SpecCorrparAbs-10 
when required to provide 4 clusters. After the second step, we observed four 
behaviors for the AHC algorithms and classified them accordingly: 

(Gl) BayesCov, BayesCorr, Infomut and InfomutNorm; 

(G2) SingleAbs and AverageAbs; 

(G3) Bic; 

(G4) CompleteAbs and WardAbs. 

While not a hierarchical clustering, SpecCorrAbs-10 provided successive 
clusterings that were in agreement with methods in (G 2 ). Algorithms in 
(Gl) and (G3) clustered Xq with {X^jX^}, creating a cluster of variables 
related to lymphocite. Algorithms in (Gl) and (G2) (and SpecCorrAbs-10) 
agreed that a partitioning of the variables in two clusters should leads to 
{Xi, X 2 , X 3 , X 5 , Xq} on the one hand and {A 4 } on the other hand. This 
clustering was also found by SpecCorrpar-10 when constrained to extract 
two clusters from the data. It was also considered the best clustering for 
BayesCov and BayesCorr. For Bic, the optimal partitioning was composed 
of three clusters, namely, {Xi, X^, Xq}, {Xi,X 2 }, and {X 4 }, which is in 
agreement with what would methods in (Gl) yield for three clusters; fur¬ 
thermore, it still kept X/j^ separated from the other variables. 

In Fig we represented the evolution of the log^g Bayes factor during 
hierarchical clustering for BayesCov, BayesCorr and Bic. Note that, while 
the clustering steps are identical for BayesCov and BayesCorr, the log Bayes 
factors are similar but not identical. Likewise, while the first two clustering 
steps of Bic is identical to those of BayesCov and BayesCorr, one can 
see that, unlike BayesCov and BayesCorr, Bic considered the merging of 
{X 3 ,X^} with Xq almost as likely as that of Xi with X 2 . Also, while the 
successive clusterings of X 3 with X 5 and then Xq as well as that of Xi with 
X 2 strongly increased the Bayes factor for BayesCov and BayesCorr, the 
improvement brought by the clustering of {X3, X^, Xq} with {Xi,X2} in 
these methods was less important. 

All in all, this analysis led us to the following conclusion: it is very likely 
that variables Xi and X 2 belong to the same cluster of dependent variables, 
and similarly for variables A 3 and Xq. Also, there is very strong evidence in 
favor of the fact that A 4 is independent from the other variables. Finally, we 
suspect that A 3 , A 5 and Xq could belong to the same cluster of variables. 


3.2 fMRI data 


Gluster analysis is a popular tool to study the organization of brain net¬ 
works in resting-state fMRI (Yeo et ah, 2011; Kelly et ah, 2012), by identi- 
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(G2): SingleAbs, 
AverageAbs 


3 5 6 1 2 4 


(G4): CompleteAbs, 
WardAbs 


3 5 6 1 2 4 

Figure 4: Toy example. Result of clustering. Algorithms in the top row 
clustered X 4 at the last step, while it was clustered at the before the last step 
for algorithms in the bottom row. Algorithms in the left column clustered 
Xq with {AsjXs}, while Xq was clustered with {Xi,X 2 } for the algorithms 
in the right column. Parts in grey correspond to clustering steps that were 
not performed by BayesCovAuto or BayesCorrAuto in (Gl), or Bic in (G3). 


Table 2; Toy example. Result of spectral clustering with increasing num¬ 
ber of clusters. 



SpecCorrAbs-10 

SpecCorrparAbs-10 

2 clusters 

12356 1 4 

3 clusters 

126 1 35 1 4 

12 1 356 1 4 

4 clusters 

12 1: 

35 1 4 1 6 

5 clusters 

1 1 2 1 35 1 4 1 6 

not reproducible 




(Gl): BayesCov, BayesCorr, 
Infomut, InfomutNorm 



3 5 6 1 2 4 


(G3): Bic 


20 



BayesCorr 


BayesCov 


Bic 


15 


10 


15 


10 


356124 356124 356124 


Figure 5: Toy example. Result of clustering for BayesCorr, BayesCov 
and Bic. The values on the y axis correspond to the log^g Bayes factor in 
favor of the global clustering obtained at each step compared to a model 
where all variables are independent (step 0 of hierarchical clustering). The 
dotted lines correspond to clustering steps that were not performed with the 
automatic stopping rule. 


fying clusters of brain regions with highly correlated spontaneous activity. 
We applied the 13 methods that were found to have good performance on 
simulations (see Fig[^ to a collection of resting-state time series. The time 
series had 205 time samples and were recorded for 82 brain regions in 19 
young healthy subjects. See Appendix for details on data collection and 
preparation. The data are available onlin^ 

We first aimed at establishing which clustering algorithms yielded sim¬ 
ilar results on these datasets. We more specifically investigated a 7-cluster 
solution, as this level of decomposition has been examined several times 


2010 

Power et al. 

2011 

Yeo et al.| 

2011 


Each clustering algorithm was applied to the time series of each subject 
independently. For a given pair of methods, the similarity between the 
cluster solutions generated by both methods on the same subject was eval¬ 
uated with the Rand. Note that the raw, unnormalized Rand index was 
used here, as we did not compare cluster solutions with different numbers 
of clusters, which is the main motivation of the normalization. The unnor¬ 
malized Rand index has a more intuitive interpretation than its adjusted 
counterpart. The Rand indices were averaged across subjects, hence re¬ 
sulting into a method-by-method matrix capturing the (average) similarity 
of clustering outputs for each pair of methods (see Fig . An AHC with 
Ward’s criterion was applied to this matrix in order to identify clusters of 
methods with similar cluster outputs. We visually identified five clusters 
of methods that had high (> 0.7) average within-cluster Rand index. The 


^http: //f igshare . com/articles/Atlanta_resting_state_fMRI_time_series 
_preprocessed_using_the_AAL_template/1521155 
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largest cluster included classical AHCs such as CompleteAbs, AverageAbs, 
WardAbs, as well as the Bic and BayesCov methods proposed here. It should 
be noted that this class of algorithms generated solutions for this prob¬ 
lem that were very close to one another (average within-cluster Rand index 
> 0.8). The BayesCorr method was also close to that large group of meth¬ 
ods, but not quite as much as the aforementioned methods (average Rand 
index of about 0.7), and was thus singled out as a separate cluster. The 
spectral methods were split into two clusters, depending on whether they 
were based on correlation (SpecCorrAbs-1 and SpecCorrAbs-10) or partial 
correlation (SpecCorrparAbs-1 and SpecCorrparAbs-lO). Finally, the two 
variants of mutual information (Infomut and InfomutNorm) generated so¬ 
lutions that were highly similar to SingleAbs. It was reassuring that the 
variants of Bayes methods proposed here performed similarly to algorithms 


known to produce physiologically plausible solutions, such as Ward (Thirion 


et al. 

2014 

Orban et al. 

in press 


While we found some analogy between 
BayesCorr, BayesCov, Infomut and InfomutNorm, it was intriguing that the 
variants of mutual information tested seemed to generate markedly different 
classes of solutions from the Bayes methods. We decided to examine the 
cluster solutions of these methods in more details. 


average inter-method 
Rand index 


0 0.2 0.4 0.6 0.8 1 



inter-method 
clustering (5 clusters) 


SingleAbs 

Infomut 

InfomutNorm 

CompleteAbs 

AverageAbs 

WardAbs 

Bic 

BayesCov 

BayesCorr 

SpecCorrAbs-1 

SpecCorrAbs-10 

SpecCorrparAbs-1 

SpecCorrparAbs-10 




— 

-1 


4 6 8 10 12 

clustering methods 


4 6 8 10 12 

clustering methods 


Figure 6: Real resting-state fMRI data—between-method similar¬ 
ity. Panel a: Rand indices between individual partitions generated with 
different methods, averaged across all subjects and scales (number of clus¬ 
ters). Panel b: Hierarchical clustering using matrix shown in Panel a as a 
similarity measure and Ward’s criterion. 


As a reference, we examined the cluster solutions generated by WardAbs, 
in addition to two variants of Bayes clustering that yielded slightly different 
solutions (BayesCov and BayesCorr), and nomalized mutual information, 
InfomutNorm. To represent the typical behavior of each method across sub¬ 
jects, we generated a “group” consensus clustering summarizing the stable 
features of the ensemble of individual cluster solutions. This consensus clus¬ 


tering was generated by the evidence accumulation algorithm (Fred and 
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Jain, 2005) outlined below. First, each partition of each subject was repre¬ 


sented as a binary 81-by-81 adjacency matrix A = (Aij), where for each pair 
(i,j) of brain regions, Aij = 1 if areas i and j were in the same cluster, and 
Aij = 0 otherwise. The adjacency matrices were then averaged across sub¬ 
jects and selected methods, yielding a 81-by-81 stability matrix C = (Cij) 
where each element Cij coded for the frequency at which brain areas i and j 
fell in the same cluster. Finally this stability matrix was used as a similarity 
matrix in a AHC with Ward’s criterion to generate one consensus partition. 
The brain regions were grouped into the same consensus cluster if they had 
a high probability of falling into the same cluster on average across subjects 
and methods, hence the name consensus clustering. 

Fig [7] represents the stability matrices and consensus clusters, for the 
four methods of interest. As expected based on our first experiment on 
the similarity of cluster outputs, the WardAbs and BayesCov methods were 
associated with highly similar stability matrices and almost identical con¬ 
sensus clusters. Many areas of high consensus could be identified (with 
values close of 0 or 1), illustrating the very good agreement of the cluster 
solutions across subjects. The outline of the consensus clusters as well as a 
volumetric representation of the brain parcellation are presented in Fig[^. 
Some of these clusters closely matched those reported in previous studies: 
cluster 7 can be recognized as being the visual network, cluster 2 the senso¬ 
rimotor network, and clusters 6 and 3 the anterior and posterior parts of the 
default-mode network, respectively ([Salvador et al. 2005 


van den Heuvel 


et al. 2008 Bellec et al. 2010). By contrast with WardAbs and BayesCorr, 


the Inf omutNorm tended to generate very large clusters, which was appar¬ 
ent both on the stability matrix and the consensus clusters. The BayesCorr 
method was intermediate between BayesCov and Inf omutNorm in terms of 
cluster size. These decompositions into very large clusters do not fit current 
views on the organization of resting-state networks. 

Overall, our analysis on real fMRI data led to the following conclusions. 
Three big subsets of methods emerged: spectral methods, mutual infor¬ 
mation (with SingleAbs), and finally all the other methods. Application 
of this last group of methods, which included the Bayes variants proposed 
here, resulted in a plausible decomposition into resting-state networks. In 
the absence of ground truth, it is not possible to further comment on the rel¬ 
evance of the differences in cluster solutions identified by the three groups of 
methods. We still noted that the methods based on mutual information led 
to large clusters that were difficult to interpret. Our interpretation is that 
the strategies implemented in Infomut and InfomutNorm did not behave 
well for these datasets. 

As a final computational note, the time required by the different meth¬ 
ods to cluster data is summarized in Table Although the differences in 
execution time may reffect the quality of the implementation, the methods 
proposed here were the slowest of the hierarchical methods, but were still 
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Figure 7: Real data—consensus clustering. A consensus clustering 
was generated based on the average adjacency matrices across all subjects 
(Panel a). The (weighted) adjacency matrix associated with the consensus 
clustering is represented along with a volumetric brain parcellation (Panel 
b). The weights in the adjacency matrix were added to establish a visual 
correspondence with the volumetric representation. Note that the brain 
regions have been order based on the hierarchical clustering generated with 
WardAbs. 
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faster than spectral clustering. 


Table 3: Real resting-state fMRI data—computational cost. Time 
required by each method to cluster one dataset. * For nonhierarchical meth¬ 
ods, we summed the times used to perform clustering at each scale. 

method minimum median maximum 


CompleteAbs 

10.9 ms 

11.5 ms 

24.5 ms 

AverageAbs 

11.0 ms 

11.6 ms 

25.2 ms 

SingleAbs 

11.1 ms 

11.7 ms 

49.5 ms 

WardAbs 

14.1 ms 

14.8 ms 

18.9 ms 

InfomutNorm 

159 ms 

170 ms 

261 ms 

InfoMut 

299 ms 

322 ms 

416 ms 

Bic 

666 ms 

673 ms 

810 ms 

BayesCov 

1.083 s 

1.094 s 

1.263 s 

BayesCorr 

1.108 s 

1.151 s 

1.133 s 

SpecCorrparAbs-1* 

1.928 s 

2.065 s 

2.291 s 

SpecCorrAbs-1* 

1.992 s 

2.176 s 

2.417 s 

SpecCorrparAbs-10* 

13.251 s 

13.939 s 

14.301 s 

SpecCorrAbs-10* 

14.456 s 

15.381 s 

16.101 s 


4 Discussion 

4.1 Contributions 

Summary. We here proposed some novel similarity measures well suited 
for the agglomerative hierarchical clustering of dependent variables. These 
measures rely on a Bayesian model comparison for multivariate normal ran¬ 
dom variables. On synthetic data with a known (ground truth) partition, 
hierarchical clustering based on the Bayesian measures was found to outper¬ 
form several standard clustering procedures in terms of adjusted Rand index 
and classification accuracy. On the toy example, the Bayesian approaches led 
to result similar to those of mutual information clustering techniques, with 
the advantage of an automated thresholding. On the real fMRI data, the 
Bayesian measures led to results consistent with standard clustering meth¬ 
ods, in contrast to methods based on mutual information, which exhibited 
a highly atypical behavior. 

Bayesian normalization of mntual information. A key feature of the 
Bayesian approach is its ability to take the dimension of the clusters into ac¬ 
count. Dimensionality is an important issue in two respects (see Appendix [A| 
for an illustration). First, mutual information is an extensive measure that 
depends on the dimension of the variables. This has motivated the intro¬ 
duction of a normalization factor in the application of mutual information 
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to hierarchical clustering (Kraskov et al., 2005; Kraskov and Grassberger 


2009). A second issue is the existence of a bias in the estimation of mutual 


information. This bias mechanically increases with the dimensionality of the 
variables. Because of the two issues described above, hierarchical clustering 
based on mutual information will tend to cluster unrelated but large vari¬ 
ables rather than correlated variables of lower dimensions. As demonstrated 


on real fMRI data, the heuristic proposed by (Kraskov et al. 2005) does not 


provide a general solution to the issue of dimensionality. Furthermore, it 
removes some interesting features of mutual information, such as the addi¬ 
tivity of the clustering measure. By contrast, the Bayesian approach takes 
the dimensionality into account in a principled way, providing a quantitative 
version of Occam’s factor (Jaynes, 2003, Chap. 20). The Bayesian normal¬ 


ization is additive rather than multiplicative, thus preserving the additive 
properties of mutual information. 


Prom similarity measure to log Bayes factor. We defined the simi¬ 
larity measure s{Xi, Xj) between any two pairs of variables Xi and Xj as 
a log Bayes factor. At each step, the pair {Xi^^Xj^) that had the largest 
similarity measure was merged. Taking into account the unique features of 
s as the log Bayes factor defined in Equation ([T) allowed us to have access 
to a global measure of fit as defined in Equation ( ^ as well as to provide an 
automatic stopping rule that behaved very well on simulated data. Going 
from a similarity measure to a log Bayes factor has other advantages that 
could take the clustering proposed here even further (see below). 


Practical value of the Bayes/Bic clustering in fMRI. The new al¬ 
ternatives to mutual information introduced in this paper (i.e., Bayes and 
Bic) proved useful for the analysis of resting-state fMRI. The benefits were 
particularly clear when compared to Inf omutNorm, which tended to create 
large, inhomogeneous clusters. By contrast, both Bayes and Bic had a be¬ 
havior similar to standard hierarchical clustering based on Pearson’s linear 
correlation. The possible benehts of Bayes and Bic over those canonical 
methods are still substantial. The mutual information first provides a mul¬ 
tivariate measure of interaction that is well adapted to hierarchical brain 


decomposition (Tononi et al. 1994; Marrelec et ah, 2008) and which has a 


clear interpretation in information theory (Watanabe, 1960; Joe| 1989; Stu-| 
deny and Vejnarova, 1998). For these reasons, the mutual information for 


Gaussian variables is more appealing than a simple average of pairwise cor¬ 
relation coefficients across clusters. Because mutual information is measured 
between clusters, it is natural to build the clusters themselves based on this 
metric. A second beneht of the proposed approach is that Bic proved to be 
the most stable of all tested methods in the range of 5-15 clusters on real 
fMRI datasets. The increase in stability over Ward’s was modest, but signif- 
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icant. This advantage may become even more substantial if the clustering is 
performed in higher dimension, i.e., with smaller areas than the AAL brain 
parcellation or even at the voxel level. 


Similarity vs. distance. Clustering techniques are based on either a sim¬ 
ilarity measure or a distance measure. While the description of the present 
manuscript mostly relied on the notion of similarity, going from one concept 
to the other one can generally be done with minor changes. For instance, 
standard hierarchical procedures which rely on the minimization of a dis¬ 
tance to perform clustering (e.g., single, complete and average linkage) can 
be applied to cases where closeness is quantified by a measure of similarity, 
simply by using the opposite of the similarity matrix as a distance matrix. 
Although the resulting measure may not define a mathematically valid dis¬ 
tance, it is not required for the procedure to work. Similarly, in a Bayesian 
framework, switching from a similarity measure to a distance measure tan- 
tamounts to switching from 


s{Xi,Xj)=ln 


piSiujlMo) 


to 


d{Xi,Xj) 


-s{X^,Xj)=ln 


piSiujlMo)' 


that is, from the log ratio of the marginal model likelihoods in favor of 
dependent variables to the log ratio of the marginal model likelihoods in 
favor of independent variables. 


4.2 Modeling choices 


Choice of priors. Any Bayesian analysis requires the introduction of prior 
distributions. In the present study, we needed the prior distribution for 
the covariance matrix associated to any clustering of X. Our choices were 
guided by one assumption, one rule of consistency, and one rule of simplic¬ 
ity. First, our assumption was to not assume a priori any sort of dependence 
between covariance matrices associated to different clusters. This allowed 
for the decomposition of any prior as a product of independent priors, see 
Equations (l8|) and (19). The rule of consistency was to consider that the 


prior for a given covariance matrix should not be contradictory at different 
levels of the hierarchy. This is why we set the prior distribution for the 
global covariance matrix S as an inverse-Wishart distribution and then de¬ 
rived the prior for any covariance matrix as the prior distribution for 
S integrated over all parameters that do not appear in Sfc; using such an 
approach, the resulting prior turned out to be an inverse-Wishart distribu¬ 
tion as well. Last, the rule of simplicity is the one that dictated the use of 
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inverse-Wishart distributions as prior distributions for the covariance ma¬ 
trices. Such a family of priors had the twofold advantage of being closed by 
marginalization and allowing for closed-form expressions of the quantities 
of interest. An inverse-Wishart distribution is characterized by two param¬ 
eters: the degrees of freedom v and the inverse scale matrix A. If S is a 
D-hy-D matrix, we must have i/ > D — 1 for the distribution to be normal¬ 
ized. Also, A must be positive definite. From there, we had two strategies. 
The first one was to set the degree of freedom to the lowest value that still 
corresponded to a well defined distribution {v = D) and a diagonal scale 
matrix that optimized the marginal model likelihood. An alternative ap¬ 
proach was to work with the sample correlation matrix, set v = D + 1 and 
equal A with the D-hy-D identity matrix 7, since this choice corresponds 
to a distribution that is associated with uniform marginal distributions of 
the correlation coefficients (Barnard et al., 2000). While we believe that our 
assumption and the rule of consistency are sensible choices, we must admit 
that we are not quite as content with the choice of inverse-Wishart distri¬ 
butions for priors. The major issue with such a family of priors is that they 
simultaneously constrain the structure of correlation and the variances. By 
contrast, it seems intuitive that clustering should depend on the correlation 
structure only, not on the variances. As such, the prior on the variances 
should ideally be set independently from the correlation structure. Priors 
that separate variance and correlation have already been proposed (]Barnard| 


et al., 2000). Unfortunately, the use of such priors would make it impossible 
to provide a closed form for the marginal model likelihood. While numeric 
schemes could be implemented to circumvent this issue, it would render the 
procedure proposed much more complex and computationally burdensome. 
By contrast, the algorithm detailed here is rather straightforward. Also, 
the influence of the prior vanishes when the sample size increases. From a 
practical perspective, the three methods proposed here (two, BayesCov and 
BayesCorr, with different priors and one, Bic, not influenced by the prior) 
exhibited similar behaviors and still outperformed other existing methods in 
the simulation study. We take it as a proof of the robustness of the method 
to the choice of prior. 


Covariance vs. precision matrix modeling. The presence of clusters 
of variables that are mutually independent is equivalent to having a covari¬ 
ance matrix 51 that is block diagonal, which is itself equivalent to having 
a precision (or concentration) matrix T = that is also block diagonal. 
One could therefore solve the problem using precision matrices instead of 
covariance matrices. The corresponding calculations can be found in Ap¬ 
pendix The main difference between the two approaches stems from the 
fact that a submatrix of the inverse covariance matrix is not equal to the 
inverse of the corresponding submatrix of the covariance matrix, that is. 
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(A~^)fc 7 ^ (Afc)“^, unless A is block diagonal; also, Wishart and inverse- 
Wishart distributions do not marginalize the same way. These differences 
are to be related to the fact that a submatrix of a covariance matrix is better 
estimated than the whole covariance matrix, while the same does not hold 
for a precision matrix. From there, we could expect the covariance-based 
approach to perform better than the precision-based approach, and the dif¬ 
ference to increase with increasing D. This was confirmed on our synthetic 
data, where the precision-based approach behaved as well as BayesCov and 
BayesCorr for D = Q but had worse results than these two approaches for 
D £ {10,20,40}. Besides, performance of the automated stopping rule was 
much more efficient with BayesCov and BayesCorr than it was with the 
precision-based approach. As a final note, basing the method on concen¬ 
tration matrices yielded a slower algorithm, arguably because of the matrix 
inversions that are required. 

Sample covariance matrix vs. full dataset. Intuitively, the structure 
of dependence of a multivarite normal distribution is included in its covari¬ 
ance matrix. All existing algorithms that we used here do not need the 
full dataset but only the sample covariance (or correlation) matrix. This is 
why we started with a likelihood model that only considers the covariance 
matrix [see Equations Q and 0]. Rigorously, this model is only valid for 
N > D] for N < D, one should resort to a model of the full data as being 
multivariate normal with a mean p and covariance matrix S. Nonetheless, 
we kept the ’intuitive’ approach as it has fewer hyperparameters, is easier 
to deal with, and leads to formulas that are simpler to interpret. Also, the 
resulting similarity measure [Equation 0] is not restricted to N > D, but 
is well defined for N < D as well. From a practical perspective, the ’intu¬ 
itive’ algorithm only requires the sample covariance matrix as input, while a 
full model would also require the sample mean. Finally, this simpler model 
exhibited good behavior on our synthetic data, even for small datasets. 

4.3 Directions for future work 

Computational costs. Regarding the computational cost, measures de¬ 
rived from mutual information or a Bayesian approach are more demanding 
than standard methods such as Average or Ward. The derivation of the 
similarity matrix and the search for the most similar pairs of clusters are 
the two critical operations that can be optimized to decrease computation 
time. It is always possible to speed up these two steps by taking advan¬ 
tage of the fact that the similarity matrices of two successive iterations of 
the algorithm have many elements in common, as all but one element of 
a partition at a given iteration are identical to the elements of the parti¬ 
tion of the previous iteration. Critically, in the case of Average and Ward 
methods, it is in addition possible to derive the similarity matrix at every 


29 


iteration only based on the similarity matrix at initialization through succes¬ 


sive updates using the Lance-Jambu-Williams recursive formula (Batagelj 


1988). By contrast, other measures, including BayesCov, BayesCorr, Bic, 


Infomut, and Inf omutNorm, have to be re-evaluated independently at ev¬ 
ery step, which means that the determinant of a covariance matrix with 
increasing size has to be computed. Finding an update formula analogous 
to Lance-Jambu-Williams for clustering methods based on variants of the 
mutual information would substantially accelerate these algorithms. 


Prom deterministic to stochastic clustering. Another extension would 
be to replace the deterministic rule of selecting the pair with largest simi¬ 
larity measure for merging by a probabilistic rule where the probability to 
cluster a given pair is given by the posterior probability of the resulting 
clustering. 


Group analysis. A last extension that could be easily implemented in 
the present framework is the generalization of the method to account for 
E different entities (e.g., subjects in fMRI) sharing the same structure but 
with potentially different covariance matrices for that structure. If each 
entity e is characterized by a variable and corresponding sample sum- 
of-square matrix one can perform AHC on each X'^^^ using and 
the corresponding similarity measure However, with a straightforward 
modification of the present method, one could also perform global AHC 
of all E covariance matrices considered simultaneously. Assuming that the 
covariance matrices of the different elements are independent given the com¬ 
mon underlying structure, then the resulting similarity measure is the sum 
of all individual similarity measures s^^^’s. 


Generalization to other types of distribution. Altogether, the Bayesian 
framework that we used provides a principled way to generalize our approach 
to distributions other than multivariate normal ones. Such generalization 
would potentially account for a wide variety of situations, such as nonlinear 
dependencies or discrete distributions. This would widen the scope of pos¬ 
sible applications of the technique, e.g., genetics (Butte and Kohane 2000 


Zhou et al. 

2004 

Dawy et al. 

2006 


Priness et ah, 2007). The issues related 


to this generalization are twofold. First, one needs a model of dependence. 
In the discrete case, one could think of multinomial distributions, origi¬ 


nating from categorical, i.e., generalized Bernoulli distributions (Papoulis 
1965, §3.4). In the continuous case, multivariate normal distributions are 


a first choice model beyond which it is not clear what to use. Multivariate 
Student-t distributions could be considered, even though our results on sim¬ 
ulated data would tend to hint that the difference with multivariate normal 
distriubtions might not be that large. One could also consider using models 
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where dependence is controled independently of the marginal distributions, 
such as multivariate copulas (Nelsen, 1999 Fischer, 2011). Another issue is 


the possibility to express the marginal posterior likelihood of the data given 
the model selected. For multivariate discrete variables, we expect it to be 
feasible, albeit computationally very challenging and sensitive to the type 
of prior distribution. For other distributions, obtaining a closed form might 
be out of reach. Nonetheless, the marginal posterior likelihood could be 
approximated using various criteria, such as the AIC ( Akaike[ 1974) or vari¬ 


ants thereof—AICc (Burnham and Anderson 


1998, §2.3.1) or AICu (McQuarrie and Tsai 


2004); (McQuarrie and Tsai 


1998 §2.4.1)—, or the BIC 


(Schwarz, 1978), which naturally appeared in the present derivation. In any 
case, any approach beyond multivariate normal distributions would drasti¬ 
cally increase the complexity of our approach, both in terms of inference 
and computation. 


Application to truly hierarchical data. In the present manuscript, we 
used a hierarchical algorithm as a way to extract an underlying structure of 
dependence from data. Our assumption was that there was one such struc¬ 
ture. Such an approach provided a simple and efficient clustering algorithm 
with an interesting connection to mutual information. However, the method 
as presented here is not able to deal with data that are truly organized hi¬ 
erarchically. Extending it to deal with such data would improve the scope 
of the algorithm. One way to do would be to use Dirichlet process mixtures 


(Heller and Ghahramani 

2005 

Heller 

2007 

Savage et al. 

2009 

Gooke et al. 

2011 

Darkins et al. 

2013; Sirinukunwattana et al. 

2012 

), together with a 


model of dependent variables. 


5 Conclusion 

In this paper, we proposed a procedure based on Bayesian model compar¬ 
ison to decide whether or not to merge Gaussian multivariate variables in 
an agglomerative hierarchical clustering procedure. The resulting similar¬ 
ity measure was found to be closely related to the standard mutual in¬ 
formation, with some additional corrections for the dimensionality of the 
datasets. These new Bayesian alternatives to mutual information turned 
out to be beneficial to hierarchical clustering on Gaussian simulations and 
real datasets alike. Because of the simplicity of its implementation, its good 
practical performance and the potential generalizations to other types of 
random variables, we believe that the approach presented here is a useful 
new tool in the context of hierarchical clustering. 
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A Illustration of the behavior of estimated mutual 
information 

In the case of a D-dimensional variable X with a multivariate normal distri¬ 
bution, the mutual information between two subvectors Xi and Xj is given 
by 

/(X„X,) = Jln^i^, (25) 

where S^, k G U j}, is the covariance matrix of X^ and | • | is the 

usual determinant function. 
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Estimation bias. In this case, mutual information is estimated using its 
plug-in estimator /, that is, Equation (25) where the model covariance ma¬ 


trices have been replaced by their estimators (i.e., the corresponding sample 
covariance matrices). For large N, this estimator suffers from a systematic 


bias, as it was shown that (Marrelec and Benali, 2011) 


E{i) = I{Xi,Xj) + 


DiDj 

2N 


+ 0 


1 


This bias is additive and is a (quadratic) function of the problem dimen¬ 
sionality. This is consistent with the fact that 2NI{Xi, Xj) asymptotically 
follows a chi square distribution with DiDj degrees of freedom (Kullback 
Chap. 12, §3.6); (|P^[200^ §11.3.2). 


Extensivity of the measure. Besides the above problem of estimation, 
mutual information itself is an extensive quantity i.e., it mechanically in¬ 
creases with an increase in the dimensionality of the problem. To better 
demonstrate this effect, consider a model of homogeneous matrix for S. 
More specifically, let Au(p) be a D-hy-D homogeneous matrix with param¬ 
eter p, i.e., a matrix with Is on the diagonal and all off-diagonal elements 
equal to p. A£){p) has two eigenvalues: 1 + {D — l)p with multiplicity 1 
(associated with the vector composed only of Is) and 1 — p with multiplicity 
D — 1 (associated with the subspace of vectors with a zero mean). This 
covariance matrix is therefore positive dehnite for 0 < p < 1 and its deter¬ 
minant is given by [1 -|- (D — l)p](l — p)^~^■ Assuming that S = Ad{p), 
mutual information can be expressed as 

^ 1 [1 -I- {Dj - l)p][l -I- {Dj - l)p] 

2 (1 ~ p)[l + ~ l)p] 

where Dk, k G U j}, is the dimension of X^- This expression can 

lead to counter-intuitive results: the mutual information is about 0.31 for 
Di = Dj = 5 and p = 0.3, while it is about 0.33 for Di = Dj = 7 and 
p = 0.25. More generally, I{Xi,Xj) is an increasing function of Di and 
Dj (as can be seen by differentiation). This means that, for the same value 
of p, mutual information will be larger for larger values of Di and Dj. It 
also means that mutual information will favor the merging of two variables 
with smaller marginal correlations if the dimensionality of the variables is 
large enough, thus systematically favoring larger clusters. To give a feeling 
of the amplitude of this behavior, set = D^ — 1 and assume that we have 
D'j^p 3> 1. Mutual information can then be approximated by 




I{Xi,Xj) 


lln 

2 (1 - P)^'u, 


P 


= - In 

2 1 -p 


2 D' 
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B Asymptotic form of the log Bayes factor 


Assume that N ^ oo. Starting from the expression (p of Equation (15), we 
have 


(j){N + Uk, Afc + Sk) — — 


A + 


Dfr 


ln|Afc + Sk\ + In r 


d=i 


iV “hi — d 


^ (27) 

Defining Sk as the standard sample covariance matrix, i.e., Sk = NSk, the 
first term of the right-hand side can be expanded as 


N + Vk 


InlAfc + S'fcl = 


N + Uk 


ln|Afc + 


A ^ 
2^2 


Dfc In A + In |5fc| + In |/ + {NSkT^^k 


DuN N Duiyv. 

= -^lnA+-ln|5fc| + -^lnA + 0(l), 

since |aA| = for any positive number a and matrix A. In the 

expression of p in Equation (27), each term in the sum can be approximated 
using Stirling approximation (Abramowitz and Stegun, 1972, p. 257) 

Inr(z) = Inz — 2 : + 0(1). 

Setting z = (A -|- -|- 1 — d)/2 for f G {i,j, i U j}, we obtain 

j^ N + ., + l-d '^ ^ N + n-d jv - ^(1 + In 2) + 0(1). 

Summing this expression over d = 1,, Dk and using the fact that J2d=i ^ ~ 
Dk{Dk + l)/2 leads us to 


Du 


E'"r 


d=l 


N Uk 1 — d 


= Dk 


^^^^ InA- y(l + ln2) 


Dk{Dk + 1) 


InA-FO(l). 


We then have for p 

P{N+Uk,Kk + Sk) = -y ln|5fc|-^(l + ln2)-^^^^fcLllinA+0(l). 

(28) 
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Since (pi^k, A^) does not depend on N, it is 0(1), and the Taylor expansion 


of Afpk, as defined in Equation (14), is also given by Equation (28). Finally, 
the approximation for s{Xi,Xj) of Equation (11) is given by 


s{X„Xj) = ^InS^ 


= Ni{Xi,Xj)- 


1 

' 2 
DiDj 


Diuji^Diuj -|- 1 ) 


E 

ke{t,j} 


Dk{Dk + 1 ) 


lnA^ + 0(l), 


where we used the fact that Dnjj = Di + Dj, and where we set 


/(X,,X,) = ^ln^^ 

^ *^ 7:1 


lniV + 0(l) 
(29) 


’iuji 


C Hyperparameter optimization 

Given a clustering {Xi,..., Xk} ofAl, optimizing the marginal likelihood 
with respect to a diagonal A leads to the optimization of 

i' — D + Dk^ , . N + v — D Dk lie- I A I 

2^'-^-^InAdrf -2^- -- — ln|5fc + Afcl, 

d=l k=l 

where kd is the cluster containing Xd- Differentiation with respect to Add 
leads to 


v — D + Dkj^ 


N + 1 / — D + Dk^ 


2A 


dd 


[('S^fcd + Afc^) 


-11 


\dd 


= 0 . 


(30) 


To obtain the solution of this equation, we notice that the equation is equiv¬ 
alent to 


Add = 


V — D + Dk, 




-1 


N + V — D + Dk^ 

Optimizing at the first level (i.e., with D clusters and Dk^ = 1) yields 

Add — ■-- 'Jdd- 

D Real fMRI data: datasets and preprocessing 


We used the ’Atlanta’ resting-state fMRI database (Liu et ah, 2009). This 
resource was made publicly available as part of the 1000-connectome projecllj 


(Biswal et ah, 2010). The Atlanta sample includes 28 subjects (age ranging 


from 22 to 57 years, 15 women) with one structural MRI and one fMRI run 


®http://www.nitre.org/projects/fcon_1000/ 
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each (205 volumes, TR = 2 s), acquired on a 3 T scanner. The datasets were 
preprocessed using the neuroimaging analysis kil|^ (NIAK), version 0.6.5c 
(Bellec et al. 2012). The parameters of a rigid body motion were first esti¬ 
mated at each time frame of the fMRI dataset (no correction of inter-slice 
difference in acquisition time was applied). The median volume of the fMRI 
time series was coregistered with a Ti individual scan using Minctracc^*^ 


( [Collins et al.[ |1994| ), which was itself transformed to the Montreal Neuro¬ 
logical Institute (MNI) non-linear template (Fonov et ah, 2011) using the 
CIVETp^ pipeline ( jZijdenbos et al. 2002). The rigid-body transform, fMRI- 
to-Ti transform and Ti-to-stereotaxic transform were all combined, and the 
functional volumes were resampled in the MNI space at a 3 mm isotropic 
resolution. The “scrubbing” method of (Power et ah, 2012) was used to re¬ 
move the volumes with excessive motion (frame displacement greater than 
0.5). The following nuisance parameters were regressed out from the time 
series at each voxel: slow time drifts (basis of discrete cosines with a 0.01 Hz 
high-pass cut-off), average signals in conservative masks of the white mat¬ 
ter and the lateral ventricles as well as the first principal components (95% 
energy) of the six rigid-body motion parameters and their squares ( |Giove| 
et al., 2009). The fMRI volumes were then spatially smoothed with a 6 mm 


isotropic Gaussian blurring kernel. Because some of the measures considered 
in this paper are poorly conditioned when the number of spatial locations 
is larger than the number of time points (BIC, Infomut and Inf omutNorm), 
the fMRI time series were spatially averaged on each of the areas of the 
AAL brain template (Tzourio-Mazoyer et ah, 2002). To further reduce the 


spatial dimension, only the 81 cortical AAL areas were included in the anal¬ 
ysis (excluding the cerebellum, the basal ganglia and the thalamus). The 
clustering methods were applied to these regional time series. Note that 8 
subjects were excluded because there was not enough time points left after 
scrubbing (a minimum number of 190 volumes was selected as acceptable), 
and one additional subject had to be excluded because the quality of the 
Ti-fMRI coregistration was substandard (by visual inspection). A total of 
19 subjects was thus actually used in this analysis. 


®http://wiki.bic.mni.mcgill.ca/index.php/NiakFmriPreprocessing 
^®http://wiki.bic.mni.mcgill.ca/index.php/MinctraccManPage 
'^http: //wiki .bic .mni .mcgill. ca/index .php/CIVET 
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E Model comparison with the concentration ma¬ 
trix 


E.l Hypothesis of dependence 

Si\jj is Wishart distributed with N degrees of freedom and scale matrix 
Siuj = 




Huj\ 


Z iDi[jj,N) 


■ iUi 


K 

2 exp 


^tr(Tjuj‘S'iuj) 


The prior for Hjuj [Equation (4) of the manuscript] directly translates into a 
prior for T njj that is Wishart with i^njj degrees of freedom and scale matrix 
^iuj = 


p(TiuilAtD) 


I a 


iUj I 


2 


^iUj ^iUj 




■ i'Jj I 


exp 






T 


iUj) 


This leads to a marginal likelihood of 


piSiujlMo) 


^iUj \ 


in 


iUj\ 


2 


Z{Diuj , iV) Z(^Di\jj , i^iuj ) 

X / ITjujl 2 exp 


—tr 


-1 


+ ^iuj 


The integrand is proportional to a Wishart distribution with N+unjj degrees 
of freedom and scale matrix {Siuj + leading to 


p(5iuj|A4D) 


N-Diuj-i 

Siuj I 2 Z [Dnjj, N + unjj ) 

Z{Diijj, N) Z{Diijj , Uiijj ) 




-1 


2 


m 


■iUj 


2 


(31) 


E.2 Hypothesis of independence 

In the case of independence, the same likelihood holds with the addition 
that, since Tjuj is block diagonal with blocks Tj and Tj, we have |Tjuj| = 
|Tj| |Tj| as well as tr(Tiuj‘S'jui) = +tr(TjS^j), leading to a likeli¬ 

hood of 


p(S'iuj|AI/, Tiuj) 


gjuil-^ 

Z {Diijj,N) 


n 

k£{i,j} 


K 

2 exp 


-^tr(Tfc5A:) 


I dXiui- 
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The prior for that derives from that of Tjuj is a Wishart distribution 
with Viyjj degrees of freedom and scale matrix ilk 


-^tr(n^iTfc) 

This leads to a marginal likelihood of 


p(Tfc|Af,) = 


\ilk 






exp 


(Press, 2005, §5.1.2) 


p(5*ui|A^/) 


, N-Pujj-l 


Huj\ 




''iUj 

2 


Z{Diuj,N) Z{Di, Uiuj) Z{Dj, v^jj) 


n / nr* 

fce{i,i} 



tr [Tfc(5fc + n^^)] 


dXfc. 


Each integrand is proportional to a Wishart distribution with N + Uiuj 
degrees of freedom and scale matrix {Sk + leading to 


p{S^uj\Mi) 


Si\jj I 2 

Z{Diuj, N) 


n 

k&{i,j} 


Z{Dk, N + k'iuj) 

ZiyDki k'iyjj ) 


{St + ni')-' 




2 


2 


(32) 
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